AITopics | in-domain performance

Collaborating Authors

in-domain performance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Submission 180: Author Response

Neural Information Processing SystemsFeb-7-2026, 08:16:35 GMT

We thank the reviewers for their thoughtful comments. Reviewers have described our work as "extremely important in that it provides a reality check for Reviewers' comments have been paraphrased for brevity. R3: It looks like the random image regularizer hurts in-domain performance. R3: Do other VQA datasets (e.g., GQA, VCR) have the same problem? R2: Do other datasets for OOD evaluation have similar problems like VQA-CP?

artificial intelligence, dataset, in-domain performance, (13 more...)

Neural Information Processing Systems

Genre: Research Report (0.33)

Technology: Information Technology > Artificial Intelligence (0.31)

Add feedback

Do Generalisation Results Generalise?

Boglioni, Matteo, Sgobbi, Andrea, Tavernini, Gabriel, Rita, Francesco, Mosbach, Marius, Pimentel, Tiago

arXiv.org Artificial IntelligenceDec-9-2025

A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.

computational linguistic, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2512.07832

Country:

Europe (1.00)
North America > United States (0.46)
North America > Canada (0.46)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)

Add feedback

Submission 180: Author Response

Neural Information Processing SystemsOct-1-2025, 22:23:04 GMT

artificial intelligence, dataset, in-domain performance, (13 more...)

Neural Information Processing Systems

Genre: Research Report (0.33)

Technology: Information Technology > Artificial Intelligence (0.31)

Add feedback

Proximal Supervised Fine-Tuning

Zhu, Wenhong, Xie, Ruobing, Wang, Rui, Sun, Xingwu, Wang, Di, Liu, Pengfei

arXiv.org Artificial IntelligenceAug-26-2025

Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT), a fine-tuning objective that incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages . Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization. Recently, post-training has become a crucial part of the overall training process. In particular, reinforcement learning (RL) algorithms, such as PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), have demonstrated significant effectiveness when applied to language models (LMs) focused on reasoning tasks. As RL is scaled over time, foundation models gain the capacity to address complex problems through more profound and extended reasoning (OpenAI, 2024; Guo et al., 2025). These reasoning models offer an abundant and valuable latent thoughts (Ruan et al., 2025) across the internet.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.17784

Country: Asia (0.46)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning

Zhang, Haotian, Liu, Liu, Yu, Baosheng, Qiu, Jiayan, Ren, Yanwei, Liu, Xianglong

arXiv.org Artificial IntelligenceJun-17-2025

The advent of parameter-efficient fine-tuning methods has significantly reduced the computational burden of adapting large-scale pretrained models to diverse downstream tasks. However, existing approaches often struggle to achieve robust performance under domain shifts while maintaining computational efficiency. To address this challenge, we propose Low-rAnk Regulated Gradient Projection (LARGO) algorithm that integrates dynamic constraints into low-rank adaptation methods. Specifically, LARGO incorporates parallel trainable gradient projections to dynamically regulate layer-wise updates, retaining the Out-Of-Distribution robustness of pretrained model while preserving inter-layer independence. Additionally, it ensures computational efficiency by mitigating the influence of gradient dependencies across layers during weight updates. Besides, through leveraging singular value decomposition of pretrained weights for structured initialization, we incorporate an SVD-based initialization strategy that minimizing deviation from pretrained knowledge. Through extensive experiments on diverse benchmarks, LARGO achieves state-of-the-art performance across in-domain and out-of-distribution scenarios, demonstrating improved robustness under domain shifts with significantly lower computational overhead compared to existing PEFT methods. The source code will be released soon.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.12394

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Compressing Language Models for Specialized Domains

Williams, Miles, Chrysostomou, George, Jeronymo, Vitor, Aletras, Nikolaos

arXiv.org Artificial IntelligenceFeb-25-2025

Compression techniques such as pruning and quantization offer a solution for more efficient deployment of language models (LMs), albeit with small performance drops in benchmark performance. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this, yet requires computationally expensive full-parameter fine-tuning. To this end, we propose cross-calibration, a novel training-free approach for improving the domain performance of compressed LMs. Our approach effectively leverages Hessian-based sensitivity to identify weights that are influential for both in-domain and general performance. Through extensive experimentation, we demonstrate that cross-calibration substantially outperforms existing approaches on domain-specific tasks, without compromising general performance. Notably, these gains come without additional computational overhead, displaying remarkable potential towards extracting domain-specialized compressed models from general-purpose LMs.

computational linguistic, proceedings, sparsegpt, (16 more...)

arXiv.org Artificial Intelligence

2502.18424

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Thailand > Bangkok > Bangkok (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(13 more...)

Genre: Research Report > Promising Solution (0.34)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)

Add feedback

Review for NeurIPS paper: On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law

Neural Information Processing SystemsJan-21-2025, 04:54:02 GMT

Summary and Contributions: This paper provides an investigation of out-of-distribution generalization in visual question answering, as benchmarked by prior works on the VQA-CP dataset. The VQA-CP dataset by Agrawal et al. has different distributions in training and test, intentionally constructed so to encourage models to truly perform reasoning and generalize better, instead of naively picking up on question-only biases in the dataset. However, the authors demonstrate how several prior works on VQA-CP have (inadvertently) gamed this evaluation dataset without necessarily making progress due to a number of issues -- 1) exploiting knowledge of how the train/test splits were constructed to build models such that a) models are conditioned on the question prefix (and so will only work well on VQA-CP test and not generalize beyond), or b) poorly fit the training set. Next, the authors provide a few naive baselines that exploit the aforementioned issues (and as the authors acknowledge -- is not useful for any practical purposes) and perform well on VQA-CP test -- 1) a random predictions model that inverts the predicted answer distribution from training to test, and 2) a learned BUTD model that artificially ignores the top-predicted answer on VQA-CP test. The fact that a random predictions inverted model performs better on number and yes/no questions -- the question set that constitutes the largest fraction of performance -- is alarming, and provides a necessary and timely check on prior works on VQA-CP.

dataset, out-of-distribution testing, vqa-cp test, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders

Lee, Hyunji, Soldaini, Luca, Cohan, Arman, Seo, Minjoon, Lo, Kyle

arXiv.org Artificial IntelligenceNov-16-2023

Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset. Dense neural retrieval methods have been proven to be generally effective in many Information Retrieval (IR) tasks (Karpukhin et al., 2020; Izacard et al., 2021; Ni et al., 2021a). These methods use learned neural encoders to obtain dense vector representations of text and the relevance of passages for any given query is estimated by computing the dot product between their encodings. Dense approaches can outperform traditional retrieval techniques (e.g., BM25 (Robertson & Jones, 1976)), as they estimate similarity beyond syntactic matching (Lin et al., 2022). Neural retrieval models are effective rankers in domains for which large supervised datasets exist (e.g., MSMARCO (Campos et al., 2016) or Google NQ (Kwiatkowski et al., 2019)). Conversely, they might struggle to generalize to settings they have not been trained on, leading to challenges in handling out-ofdomain tasks (Thakur et al., 2021a; Ren et al., 2022; Lupart et al., 2023). In most real-world applications, supervision data is not available; whereas, retrieval models play a key role in the nascent field of augmented language models across many new exciting scenarios (Mialon et al., 2023).

corpusid, dataset, semanticscholar, (16 more...)

arXiv.org Artificial Intelligence

2311.09765

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Switzerland (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

Add feedback

Measuring the Robustness of Natural Language Processing Models to Domain Shifts

Calderon, Nitay, Porat, Naveh, Ben-David, Eyal, Gekhman, Zorik, Oved, Nadav, Reichart, Roi

arXiv.org Artificial IntelligenceJul-1-2023

Existing research on Domain Robustness (DR) suffers from disparate setups, lack of evaluation task variety, and reliance on challenge sets. In this paper, we pose a fundamental question: What is the state of affairs of the DR challenge in the era of Large Language Models (LLMs)? To this end, we construct a DR benchmark comprising diverse NLP tasks, including sentence and token-level classification, QA, and generation, each task consists of several domains. We explore the DR challenge of fine-tuned and few-shot learning models in natural domain shift settings and devise two diagnostic metrics of Out-of-Distribution (OOD) performance degradation: The commonly used Source Drop (SD) and the overlooked Target Drop (TD). Our findings reveal important insights: First, despite their capabilities, zero-to-few shot LLMs and fine-tuning approaches still fail to meet satisfactory performance in the OOD context; Second, TD approximates better than SD the average OOD degradation; Third, in a significant proportion of domain shifts, either SD or TD is positive, but not both, and therefore disregarding one can lead to incorrect DR conclusions.

computational linguistic, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2306.00168

Country: